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(57) A data processing system for performing multiplication, e.g. modular multiplication as used in 
implementing the Montgomery Reduction Algorithm, includes a multiplier 20 which simultaneousV performs 
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DATA PROCESSING SYSTEM FOR PERFORMING MULTIPLICATION 
AND MULTIPLICATION METHOD 

5 FIELD OF THE INVENTION 

This invention relates generally to a data processing 
system for performing multiplication using a 
multiplier, and more particularly for performing 
10 modular multiplication such as used in iiiQ>lCTienting the 
Montgomery Reduction Algorithm. 

BACKGROUND OF THE INVENTION 

15 

Modular multiplication is extensively used in 
implementing cryptographic methods such as RSA 
cryptography . 

20 The Montgomery algorithm is one of. the most efficient 
techniques for performing modular multiplication. Its 
use is particularly effective where high performance is 
required so as to minimise the computation time. 

25 The MontgcOTiery proof is given in ;^)pendix 1 and the 
Montgomery Reduction Algorithm is outlined below: 

Montoamei^ Alonr^^>l^ 

30 To enact the P operator on A.B we follow the process 

outlined below: 

(1) X = A.B + S {S initially zero) 

(2) Y = (X.J) mod2n {%«ihere J is a pre-calculated 

constant, ^d n is the number 
of bits in N) 

(3) 2 = X + Y.N 

(4) S = Z/2n 
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(5) S = S (modN) (N is subtracted from S. if S S N; 

else S = S) 

Thus S = P = P(A.B)n (the result in the Montgomery 
g Field o£ nianbers) 

in financial applications where smartcafds are used as a 
xoeans of ensuring a high level of security during the 
transaction. Public key cryptography is becoming 
10 increasingly popular. Public cryptography offers a 

higher level of protection than the traditional symmetrxc 
or private key methods, but until recently has been 
expensive to implement. Advances in technology have now 
„«de the inplementation of such methods cost effective. 
15 RSA Public Key capability has been designed into smartcard 
ndcrocontrollers which also include an on-chip co- 
processor ^ich has been specifically designed to perform 
nodular multiplications for operands each of 512 bxt 
length. The co-processor is directly driven the 
20 microcontroller's CPO under software control by a program 
stored either in RW or in EEPROM. Such a co-processor 
vrttich implements the Montgomery algorithm for modular 
reduction without the division process is known from 
4rppean Patent Publication EP-0601907-A. 

" in performing tbe nnxltiplications of equations (1) and 
(3) above, two separate 32 x 1 bit nmltipliers (MLl ^ 
ML2 of FIG. 1) are used. -mus. to perform a 512 x 512 
bit loultiplication. 16 rotations are necessary, wxtb an 

30 approximate consumption of 544 clock cycles for each 
rotation. Furthermore, the operands are derxved f rom 
values B, S. and N received from fixed length shxft 
registers. Accordingly, such an architecture wxll be 
unable to efficiently perform multiplications on 
35 operands of greater than 512 bits. The silicon area 
required to expand the architecture to accept larger 
operands is prohibitive. Furthermore, there xs an 
increasing dea^d for faster cryptographxc operatxons 
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such that 8000+ clock cycles to perform a 512 x 512 
multiplication is unacceptable* 

Therefore, a need exists for an inproved data 
processing systCTi which is capable for performing 
multiplication in fewer clock cycles than in prior art 
systems. Further, it is desirable that such a systen 
be readily adaptable to operands of greater length 
while requiring minimal silicon area to inplement. 

BRIEF DESCRIPnai OF THE DRAWINGS 

FIG- 1 shows a block schematic diagram of a known, 
prior art co-processor for performing modular 
multiplication to inplement the Montgomery 
Reduction Algorithm; 

FIG- 2 shows a block schematic diagram of an 
inproved data processing system for performing 
multiplication, e.g. modular multiplication as is 
used to implement the Montgomery Reduction 
Algorithm; 

FIG- 3 shows a sinplified block sch«natic diagram 
demonstrating the iiqmts and outputs of the 
multiplier used in the data processing syst^ of 
FIG. 2; 

FIG. 4 is a chart demonstrating the values added 
by the multiplier used in the data processing 
system of FIG. 2, when used to perform modular 
arithmetic for in^lementing the Montgomery 
Reduction Algorithm using four operands. 

FIG. 5 shows a block schematic diagram of' a 
portion of the multiplier used in the data 
processing systan of FIG. 2 which demonstrates one 
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embodiment for selecting which values are to b 
added in performing the multiplication of FIG. 



5 DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT 
rn-nrorp''*'"^ QneratioD 

FIG. 1 Shows a diagram of a known, prior art hardware 
10 inplenentation of a data processing system, in the form of 
a co-processor, which performs the Montgomery algorithm 
for both full mode 512 bit and half -mode 256 bit operands. 

The diagram shows the execution unit %»hich comprises 
15 basically three 512 bit clocked shift registers (Shift 
Registers B. S. and N) and two parallel - serial 
multipliers (MLl and ML2) . 

The B value and the modulus N are preloaded into the B and 
20 N registers respectively. Register. S is used to store the 
intermediate result after each rotation of approximately 
544 clock cycles (512 to perform the multiplication and 
another 32 to shift the last 32 bit value out of the 
ilWister) . initially the S register will be cleared. The 
25 pre-calculated Montgomery Constant. Jo. is loaded xnto the 
co-processor via a 32 bit shift register and latched xn 
Latch2. 

Bxe A value is shifted in 4 bytes bits) at a ti««. 

30 (Ai) via multiplexer M2_l;2 and latched xn Latchl. The 
value in the B register is serially clocked one bit at a 
time into a first parallel - serial multiplier MLl. ^e^ 
output of this multiplier, at node n„. is the value A..B. 
The value Ai.B is then summed at adder ADl to the 

35 intermediate value stored in register S to produce the 
value X = Ai.B ♦ S. at node ns- 
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For the first 32 clock cycles, the first 32 bit portion of 
the X value is fed via xoultiplexer M3_l;4 into a second 
parallel - serial multiplier ML2, vdiere it is multiplied 
by the value Jo- The output from ML2 at node no is the 
5 value Yo = X,Jo- Yo is fed back through a 32 bit shift 
register and latched in Latch2 via multiplexer M. 

After the first 32 clock cycles, multiplexer M3_l;4 
switches and feeds the modulus N into the multiplier ML2, 

10 where N is multiplied by Yo to produce the value Yo-N. 

This value is then summed, over the next 544 clock cycles, 
with X at adder AD2 to produce the value Z = X + Yq.N. The 
least significant 32 bits of this calculation are zero and 
only the 512 most significant bits are saved back in the S 

15 register. This cosnpletes one full rotation. 

Sixteen rotations, using a 32 bit multiplication, are 
required to perform the full 512 bit by 512 bit 
multiplication, which gives: 

20 

P = A.B.I (modN) = P(A.B)n (the result in the Montgomery 

Field of numbers) . 

# 

To recover the required result, P is multiplied by H {a 
25 pre-calculated Montgomery constant) to give the result in 
the field of real numbers: 

R = A.B (roodN)= P(P.H)n 

30 Various improvCTients have been made to the co-processor 
architecture shown in FIG. 1. For exanflple. instead of 
using a single serial loop clocking stream, the 
architecture can be modified to use bit-pair 
multiplication, addition, and siabtraction, where the 

35 adders, subtracters, and parallel-serial multipliers 
compute results two bits at a time. This inprovement 
immediately doubles the performance for the same clock 
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frequency. For further explanation of this improvement, 
refer to GB Application No. 9622719.4. 

A further improvement is achieved by replacing the 512 bit 
5 clocked, serial shift registers with a combination of 
random access memory (RAM) , a parallel to serial 
interface, and an 8 bit clocked shift register. The 
advantages of such a replacement include 1) reduced power 
consumption, 2) greater flexibility in handling varyxng 
0 operand lengths, and 3) the ability to perform 

exponentiation without intervention by the CPU. Further 
explanation of the RAM-based improvement can be found in 
GB Application No. 9622714.5. 

L5 Yet another improvement to the architecture shown in 
FIG 1 is the use of a single 64 bit clocked serxal 
multiplier to calculate A.B and N.Y on alternate clock 
p^es but still within a single clock cycle. 
a single multiplier increases performance (only half the 

20 number of rotations are needed) Without increasing the 
clock frequency, while consuming only slightly more 
silicon area than using two 32 x 2 bit multipliers. 
Further description of the implementation of thxs sxngle 
J^ltiplier implementation can be found in GB Applxcatxon 

25 NO- 9701958.2. 

respite tbe «rious i^™.ts which have be^ 
pr^osea to the co-processor architecture o£ FIG I, there 
to be a neea for .«re efficient multiplication. 



35 



Iteferrina now to FIG. 2. an l»prov«a data ^^'^ 
system 10 for performing multiplication is shown In a 

L form Of an integrated circuit co-processor suitable 
Z inclusion in a s^artcard. System 10 includes randa. 
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access memory (RAM) portions 12, 14, 16, and 18 for 
storing values of B, N, A, and S, respectively, as shovm. 
For purposes of describing the invention, the assumption 
is made that B, N, A, and S values are 512 bits, however 
5 the invention is not limited to this particular size. In 
such an embodiment, RAM portions 12-18 are preferably 512 
bit eurrays, but could be 1024 bits arrays or larger. 

System 10 further includes a multiplier 20. In a preferred 
10 CTibodiment, multiplier 20 is a parallel multiplier. If 
values A, B, and N are 512 bits, multiplier 20 is 
preferably a 16 row by 17 column (hereinafter, 16 x 17) 
multiplier which performs 16 bit by 16 bit multiplication. 
In performing a multiply operation of 16 bit x 16 bit, 
15 multiplier 20 should acccHiinodate a 33 bit wide result (the 
extra bit is needed to acc<Htsnodate the accimiulation of A+Y 
as explained further below) . Alternatively, multiplier 20 
could be a 32 x 33 bit multiplier for multiplying 32 bits 
of each value at a time, or even larger, with 
20 correspondingly larger accumulator/output stage. However, 
for the r^nainder of the description it is assumed that 
data processing system is used to perform 16 bit 
multiplication . 

25 Multiplier 20 is used to simultaneously multiply A.B and 
y,N and to sum the results in accordance with the present 
invention. To achieve this, as e9q>lained further below, 
values J, A, y and A+Y are input into the multiplier via a 
register 22. The B value is fed to the multiplier through 

30 a first subtracter, Subl, and a first miatiplexer, Mxl, 
the purposes of vrhich are explained further below. The N 
value is fed to the multiplier through a second 
multiplexer, Mx2. Mxl and Mx2 are sized according to the 
size of the multiplier. 

35 

Data processing system 10 also includes a second 
subtracter, Sub2, for subtracting the value N from S (see 
equation (5)). An adder, Addl, is included to calculate Z 
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as the sum of the output of nultiplier 20 (A.B * Y.N) and 
S (see equation (3)). Addl is also used to calculate A^Y 
as explained further below. Borrow detection circuitry 24 
is used to deterrnine whether N is to be subtracted from S. 
and thus is used to enable or disable Subl and Sub 2 as 
indicated. 

in performing the Montgomery Redaction Algorithm, data 
processing system 10 worlcs as follows. J. having a b.t 
width which matches the width of multiplier 20 (xn thxs 
case a 16 bit value) . is precalculated Montgomery Constant 
which is loaded into register 22 from the central 
processing unit (CPU) of the systen via a data bus 
interface 26. The J value remains constant throughout 
each 512 x 512 multiplication and is held in a 16 bxt 
register field. Value A is latched into regxster 22 16 
K^^<, at a time from node n2 (thus A is denoted in the 
fi^LS Is Z, Which receives the value 16 bits at a txme 
directly from RAM portion 16. 

value Y »st be calculated a. in e^tic» «> 
A valued loaded into register 22 (tl»» Y^s ^'^^ 
thTfisures as Y^, . Ac<:ordlnc,ly. X nust be calculated to 
^ able to determine the product X.J. rr». 
25 r=A.»*S. only 16 bits of Y are needed, therefore 16 
hitB of A and 16 bits of B are «iltiplied to achieve Y. 
: t^ 1^ fed the 16 least significant bits 

:rthe A.B result. By setting the mk2 output to be the 
.0- input, multiplier 20 calculates A.B. Value* xs 
received as described above. Value B is received fro. BAH 
~,rtlon 12 with or without subtraction by Subl as 
rred ^ l^rrow detecti» circuitry 
the multiplier at n3 is thus A.B. "."^ !^ 

result are fed to Addl. S is added to thxs 
^ to produce A.B * S at node nl. This node is fed to 
^ su^ that on the next clocK cycle, nl rather than a B 
^xJ s fed into multiplier 20. ""^tipUer 20 th«- 
calculates the value at nl .A.B. . S. t>mes J (from 



20 



30 



35 
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register 22) . The 16 LSBs of the product X.J are then fed 
to register 22 as the value of FIG. 2. 

The value A+Y is calculated in the same clock cycle that 
5 Yj is fed to register 22 by using Addl to add A^ to the Y^ 
value at n3 to produce A^^+Y^ at node nl. The nl value is 
then fed to register 22 . Note that the register values 
for each of J. A, and Y are 16 bits wide, while that of 
A+Y is 17 bits wide because the addition of two 16 bit 
10 values may produce a 17 bit value. 

With the iii5>l«iientation illtistrated and described, it will 
take approximately 7 clock cycles to load J, A, Y, A+Y 
into register 22 before imiltiplication of A.B + Y.N can 
15 begin. Once the values in register 22 are set, data 
processing system 10 cq>erates to determine Z. From 
equations (1) and (3), z = S + A.B + Y.N. In accordance 
with the present inventira, A.B + Y.N is calculated 
simultaneously using a single parallel multiplier. 

20 

As represented by FIG. 3, the N value and B value are 
input into multiplier 20 as the two multiplicands of A.B + 
Y^.N. A and Y values are input as the multipliers of A.B ♦ 
Y.N. In accordance with the present invention, the 

25 additional input of A+Y is ir«>ut into multiplier 20 to 
perform the operation A.B + Y.N. In its simplest form, 
the operation can be broken down into a series of 
additions and shifts. When a bit of B is •0», nothing is 
added to the accumulated result of the multiplier. When a 

30 bit of B is -1', A is added to the accumulated result of 
the multiplier. Similarly, »*ien a bit of N is "0", 
nothing is added to the accumulated result of the 
multiplier. When a bit of N is "1-, Y is added to the 
accuoiulated result of the multiplier. When both 

35 corresponding bits of B and N are both A and Y must 

be added. 
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in accordance with one enOxxiiment of the invention, B and 
N are fed to multiplier 20 16 bits at a time and are 
multiplied with 16 bits of A and Y as described above. 
Following this multiplication, the next 16 bits of B and N 

5 are fed to multiplier 20 and are multiplied with the same 
16 bits of A and Y already stored in register 22. The 
process continues until all 512 bits of * and N have been 
multiplied by 16 bits of A and Y. This completes one 
rotation of the multiplication. In total the rotation 

10 takes approximately 43 clock cycles: about 7 to load 

register 22; 32 to compute A^.B * Y^.U; and about 4 to get 
the data through the various adders and substractors . 

After the completion of one rotation, the next 16 bits of 
15 A are fed to register 22 and Y and A*Y are recalculated 
using the new A value as described above. J is unchanged. 
After the new register values are set, B and N are again 
fed to multiplier 20 16 bits at a time, until all 512 bxts 
of B and N have been multiplied using the new register 
20 values. This completes a second rotation. Rotations are 
repeated until all 512 bits of A have been used, resulting 
in a total of 32 rotations for a 512 x 512 multiplication, 
which corresponds to about 1376 clock cycles. This 
ll^rformance is a 7x improvement over the prior 
25 inplementation described in reference to FIG. 1 and a 3 5x 
improvement over the use of two 2x32 multipliers. If the 
present invention is implemented using a 32x33 parallel 
nnxltiplier, only 432 clock cycles are needed to perform a 
512 X 512 multiplication because only 16 rotations of 
30 approximately 27 clock cycles are required. 

^oughout a 512 x 512 multiplication, an intermediate S 
value is generated. Initially. S=0. After the first 16 x 
16 multiplication of A.B . Y.N, the 16 LSBsof the 
35 intermediate result are fed to Addl -^^^ ^ 
last S value. Mx3 controls what is added at Addl. As 
described above, Mx3 initially passes the next A value to 
Addl to produce the A*Y value stored in register 22. 
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During the nrultiplication of .A.B and Y.N, Mx3 then passes 
the previous S value from S-RAM portion 18 to Addl. The 
result from Addl is fed to S-RAM portion 18 as a new S 
value. Each time S is added at Addl, borrow detection 
5 circuitry is used to determine if the new value of S is 
greater than N. If greater, borrow detection circuitry 24 
enables S.ub2 for the next rotation, and H is subtracted 
frcan S before the value is added by Addl. After the 
entire 512 x 512 multiplication has been performed, the 

10 last S value is stored in B-RAM portion 12 because S is 
sometimes used in subsequent calculations as the B value. 
Again, however, N must be suObtracted from this value if 
S>N. Thus, Subl is included. Subl is enabled by borrow 
detection circuitry 24 during such subsequent calculations 

15 if the S value stored in B-RAM portion 12 is greater than 
the N for the former calculation. 



Also throughout the multiplication, the A, Y, A+Y and 0 
values are accisnulated in the multiplier. In accordance 
20 with the present invention, the 17 most significant bits 
(MSBs) of each 33 bit multiplication result are carried in 
the accumulator of the multiplier, v«iile the 16 LSBs are 
fed to Addl as explained above. 

25 An example of the operation of multiplier 20 is shown 
below in FIG. 4. An arbitrary B value of 00111100 is 
suK>lied to the multiplier. An arbitrary N value of 
10101010 is supplied to the multiplier. To perform A.B ♦ 
Y.N, the bits of B and N are examined to determine what 

30 value is to be added to the accumulator result stage of 
the multiplier, in this exanple, the LSBs of B and N are 
"0* so nothing is added, as indicated in FIG. 4. The next 
LSBs of B and N are -O* and respectively, so Y is 

added. The next are and "0-, respectively, so A is 

$5 added. The next are "1- and -l- so A and Y are both added 
using the precalculated A+Y value. These examples 
demonstrate the four possible combinations of B and N bit 
comparisons. The addition of either 0, A, Y, or A+Y is 
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determined for each B-N bit pair, but is done in a single 
clock cycle for whatever B & N size value is supplied to 
the multiplier. 

A more particular exainple is shown below in reference 
to TABLE 1. For simplicity, only 4 bit arbitrary values 
of B. N. A. Y will be used as follows. The number in 
parentheses is the decimal equivalent of the binary value. 
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B = 


1001 


(9) 


N = 


1100 


(12) 


A = 


1010 


(10) 


Y = 


1101 


(13) 


A+Y 


= 10111 


(23) 



15 



N B 


Add 




0 


1 


A 






1 


0 


1 0 


0 


0 


0 




0 


0 


0 


0 


1 


0 


Y 


1 


1 


0 


1 




1 


1 


A+Y 


10 1. 


1 


1 







110 110 



20 



25 



30 



TABLE 1 

Thus. A.B + Y.N equals 11110110. or 10.9 * 13.12 = 246. 

in order to implement the operation of multiplier 20 
as described above, circuitry is needed to control which 
of A Y A+Y, or 0 is added. One enixxJiment for such 
control' is illustrated in FIG. 5. FIG. 5 is a portion of 
the multiplier showing decode circuitry 30 associated wxtb 
each row of the multiplier. In one form, the decode 
circuitry may cornprise a 3:1 multiplexer 32 for each row 
and a switch 34 for each bit. Multiplexers 32 are used to 
control which of A. Y. or A+Y is active for each row (i.e. 
which value is passed through each switch xn the 
respective row) . The output of the multiplexer xs 
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determined by the inputs of B and N bit values as 
described above. If all of A, Y, and A+Y are inactive 
(i.e. both the B and the N bit values are 0). 0 is passed. 

It is noted that the circuitry described and shown in 
reference, to FIG. 5 is but one exanqple of how a multiplier 
could be inplenented in accordance *dth the present 
invention and should not be viewed as limiting the scope 
of the invention. 

As mentioned above, a 16x17 multiplier is preferred to 
perform a 512 bit encryption (assimiing 16 bit calculations 
per clock) . The 17th column of the multiplier is really 
"overhead" that is needed because A+Y can produce a 17 bit 
value. The area consumed by such a multiplier is 
estimated to be about 2924 gate equivalents, calculated as 
follows. An NxM multiplier %«ould include (N-1) (M) full 
adders, each with a gate equivalent of 8; M half adders, 
each with a gate equivalent of 4; and NxM 3:1 
multiplexers, each with a gate equivalent of 3. In 
contrast, the implementation of two 2x32 serial 
multipliers as done in the prior art is approximately 2700 
gate equivalents. Thus, for a slight increase in silicon 
area, the performance of the multiplier can be 
significantly improved. Two 2x32 serial multipliers can 
perform the 512 x 512 bit multiplications in approximately 
4864 clocks {512/2 bits multiplied per clock ♦ 48 overhead 
clock cycles, for each of 16 rotations), whereas a 16x17 
parallel multiplier in accordance with the present 
invention requires only approximately 1376 clocks, as 
explained above. If silicon area is available to use a 
32x33 parallel multiplier in accordance with the present 
invention, the calculation can be done in just 432 clocks 
because only half as many rotations are needed. 

The foregoing description and illustrations contained 
herein demonstrate many of the advantages associated with 
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the present invention. In particular, it has been 
revealed that a multiplier can be used to perform the 
entire operation of A.B + Y.N. as is needed in performing 
the Montgomery Reduction Algorithm. The use of a single 
parallel multiplier eliminates the need to increase clock 
speed to improve performance. Thus, there is no increase 
in power consumption by the data processing system, and 
design difficulties experienced at higher clock 
frequencies are avoided. Moreover, the present invention 
can be iJ^plemented in a silicon area cai«)arable to that 
used in prior art processors which used two multipliers to 
perform the same operation, with improved performance 
Furthermore, the present invention has significant silicon 
area savings as compared to using two 16x16 parallel 
processors, one to perform A.B and the other to ^^^^ 
Y N The only additional hardware needed to perform both 
ope^tions with the same multiplier is an additional 
column in the multiplier (to acco«aodate the accumulation 
of A+Y) and some decode circuitry for each row to 
determine which of A. Y, A*Y or 0 is to be added. 
Moreover, the multiplier of the present invention can be 
used to perform traditional multiplications such as A.B. 
in addition to performing A.B + Y.N. 

«us it is apparent that there tos been prc^l^. in 
accordance with the invention, a data processing 8YSt» 
having a .naltiplier and ^Xtiplication method usxng the 
sa»e that fully neets the need and «h«ntaBes set fortt 
previously. Mthough the invention has been described and 
illustrated with reference to specific e»bodl«nts 
««reof. It is not Intended that «>e invent^ be 1^^ 
to these illustrative e-bodi-ents. •mose sicilled xn tte 
Irt -ill recognise that modifications and variations can 
art wiii I "3 invention, 
be made vdthout departing from the scope o 

Por e«.^le. the present invention is not 1""'-^." "T, 
particular size of multiplication, nor Is it required that 
the inv«.tion be used to tapl«.ent the Hontgo«ery 
Mgorith». «hile the values A. B. H. and Y «:e used in 
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the description and claims, such designations are for 
reference only, the present invention can be used to 
perform any calculation in the general form A.B + CD. In 
addition, the invention is not limited to iznplCTientations 
5 having a parallel multiplier. A serial multiplier could 
be used, but would not have the performance of a parallel 
multiplier as herein described. Also, in the instance 
where Y is a fiinction of the product of two other input 
values (e.g. A.B), it is not required that the same 

10 multiplier be used to calculate Y (to provide values Y and 
A+Y as multiplier inputs) as is used to perform A.B. + 
Y.N. A separate multiplier could be used to calculate Y, 
albeit with additional silicon consuir5)tion. Therefore, it 
is intended that this inv^tion encoiqpass all such 

15 variations and modifications as fall within the scope of 
the appended claims. 
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;^pendix 1 

The Montgomery function P(A.B)n performs a 
miltiplication modulo N of the product A.B into the P 
field. The retrieval from the P field back into the 
normal modular field is performed by enacting P on the 
result of P(A.B)n and a precalculated constant H, 

THUS if P == P(A.B)N» then P(P.H)N == A.B (modN) . 



(1) 
(2) 
(3) 
(4) 



proof 

15 we require to calculate R = A.B (modN) . 

First find Q, such that: 

P2n = A.B + Q.N (where N is odd) 

Note: ^ ^ „^ 

20 1-2° == 1 <»o<3N) (and n is the bit length of N) 
Multiply equation (1) by I to give: 

P. I, 2" = A.B.I + Q.I.N 
consider the left side of (3) . from (2) : 
* P.I.2" == P (modN) 
25 consider the right side of (3). then from (4): 
P == {A.B-I ♦ Q.I.N) (modN). and therefore: 
P == A.B.I (modN) = P(A.B)n 
consider P(P.H)N then from (5): 
P(P.H)h == A.B.I2.H (modN) 
30 Clearly if H is defined as I'^ then: 
R == p(P.H)h == A.B (modN) 
Equation (7) gives the desired result. 

From (2) above, H = (modN) and is a precalculated 
35 constant depending only on N and n. , 
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It next requires that Q be found. From (1) it can be seen 
that: 

{A.B.I + Q.I.N) (inod2n) = 0 (8) 
This inplies: 
5 . A.B.I (iD0d2n) = -Q.I.N (modCn) and therefore, 

Q == -N-l A.B (mod2n) (9) 
For odd N, J = N-i such that N.J = 1.2" -f 1. 
Hence Q == - A.B.J (mocQn) . 

Note, J is also a precalculated constant d^^ending only on 
10 N and n. 
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CLAIMS 

1. A data processing system for performing 
multiplication comprising: 

means for storing a plurality of binary data 

values. A, B, N. and Y; and 
a irailtiplier having inputs for receiving 
values A, B, N, and Y from the 
respective means for storing said 
values, and further having an ii^wt for 
receiving a value of A+Y, wherein said 
multiplier computes an operation A.B + 
N.Y. 

15 2 The data processing system of claim 1 wherein the 
multiplier is s parallel multiplier and is the 
only multiplier in the data processing system used 
to compute said operation. 

20 3. The data processing system of. claim 1 or 2 wherein, 
the multiplier is used to compute Y, which is a 
function of a product of two of the other values. 

The data processing system of claim 1, 2, or 3 
wherein said values A, B, N, and Y are used to 
perform the Montgomery algorithm. 

The data processing system of any preceding claim 
wherein said means for storing values B and N 
comprises random access memory, and lAerexn saxd 
means for storing values A and Y comprises a 
register means, said register means also used for 
storing the value A+Y. 



4. 



25 



5. 

30 
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The data processing syst^ of any preceding claim 
further conprising decode circuitry coupled to the 
multiplier for selecting which values of A, Y, 
A+Y, or 0 are accumulated within the xmiltiplier. 

An integrated circuit uAiich CTibodies the data 
processing system of any preceding claim. 

A smartcard including an integrated circuit 
embodying the data processing system of any 
preceding claim. 

A method for performing an operation A.B + Y.N 
conprising the steps of: 

providing a data processing systCTi having 

means for storing binary values A, B, Y, 
and N, a multiplier, and adder means; 
using said adder means to calculate a bineay 
value A+Y; 

providing the values B and N as a first set 
of inputs to the multiplier, and 
providing values A, Y and A+Y as a 
switchable second set of inputs to the 
multiplier; 

performing the operation A.B + Y.N using the 
multiplier by selectively accumulating 
either A, Y, A+Y, or 0 in the multiplier 
depending xxpon each bit value of B and N 
as input to the multiplier. 

The method of claim 9 wherein the step of 
performing the operation A.B + Y.N is a part of 
performing the Montgomery Algorithm. 

The method of claim 10 further coiiprising the step 
of using the multiplier to calculate Y as a 
function of a product of two of the other values. 



- 20 - 

12. An integrated circuit that performs the method of 
any preceding claim. 

13 . A smartcard including the integrated circuit of 
5 claim 11- 
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